feat(webhook): dedicated sharded apply + plan comments (per-shard, group-by-variant)#546
Merged
Merged
Conversation
…ariant)
A sharded apply fans out across the shards of one keyspace within a single
deployment, but the comment renderer reused the multi-deployment layout, whose
unit is the deployment. Every per-shard operation shares the one deployment
("cake"), so all shards rendered as identical "cake" rows, and — because the
per-deployment Details map is keyed by deployment name — the per-shard detail
collided and a failed shard's error was dropped (it showed a pending sibling's
body instead). That is why a failed rollout surfaced no error.
Add a dedicated sharded layout:
- pkg/webhook/templates/sharded_apply.go: RenderShardedApplyComment. Shards are
grouped by change signature — a uniform apply renders one Shard|Status table
with the DDL once; divergent shards (different tables, or the same table
computing to different DDL because shards drifted) render one group per
distinct change set, each tying its shards' statuses to the DDL it runs. The
first failed shard's error is lifted to the top and shown in its row.
- pkg/webhook/sharded_apply.go: detect a sharded apply (work ops keyed
namespace/shard/table, one deployment), and build the comment input from the
operation rows + tasks. Per-shard status is derived through pkg/presentation
with the shard name as the operation identity, so ordering labels reference
shards ("waiting for -40", "halted — -40 failed") with no presentation change.
formatApply{Status,Summary}Comment route sharded applies here.
Finalizer (VSchema) operations are out of scope for the shard view for now;
their outcome is still reflected in the aggregate headline state.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a dedicated PR-comment renderer for sharded applies (per-shard fan-out within a single deployment), fixing the prior multi-deployment layout’s shard-collision behavior and enabling correct per-shard status + error surfacing, including divergent shard variants.
Changes:
- Introduces a new sharded-apply comment template that groups shards by change signature and surfaces the first shard failure prominently.
- Adds sharded-apply detection + data projection in the webhook layer, routing status/summary comments to the new renderer.
- Adds unit tests covering sharded key parsing/detection, sharded comment rendering (uniform/divergent), and error surfacing.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| TEMPLATES.md | Updates documented template example headings/metadata formatting. |
| pkg/webhook/templates/sharded_apply.go | New sharded-apply PR comment renderer (group-by-variant, shard-unit tables, first-failure callout). |
| pkg/webhook/templates/sharded_apply_test.go | Tests for sharded-apply template rendering behavior (uniform/divergent, error surfacing). |
| pkg/webhook/sharded_apply.go | Sharded-apply detection and projection into template input (per-shard status via presentation). |
| pkg/webhook/sharded_apply_test.go | Tests for sharded key parsing/detection and sharded status comment output regression coverage. |
| pkg/webhook/multi_apply.go | Routes sharded applies to the new sharded template for both status and terminal summary comments. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The sharded layout only recognised the terminal Failed state, so an apply that SchemaBot is auto-retrying (failed_retryable) lost its error and its next-action guidance. Mirror the single-deployment renderer: - isShardFailureState covers Failed and FailedRetryable; the first-failure callout and the per-shard status cell both surface the error for either. - writeShardedFooter adds the FailedRetryable case with the "stop retrying" command, matching writeApplyFooter. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The plan comment was keyspace-level: a divergent sharded plan (shards needing different changes — drift, or a partially-applied canary) collapsed to one DDL block, so an operator couldn't see what would apply where. The per-shard plan data block#535 persists never reached the comment — apitypes.PlanResponse carried only the namespace-level Changes. - apitypes.PlanResponse gains Shards ([]ShardPlanResponse), and planResponseFromProto carries them from the proto plan response. - KeyspaceChangeData gains per-shard Shards; buildPlanCommentData threads them. - writeKeyspaceChanges groups a sharded keyspace's shards by change signature: uniform shows the DDL once; divergent renders "what applies where" — one DDL block per distinct change set, labelled with the shards it applies to. No gap change needed: the per-shard data already flows from gap's Strata engine through the proto plan response; only the control-plane rendering changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
buildShardedApplyData took only the first task's DDL for an operation. An operation can carry more than one task for its (namespace, shard, table) when a shard plan yields multiple statements for the same table, so this dropped statements and corrupted the change signature used to group shards. Join every non-empty task DDL in task order instead. Covered by TestBuildShardedApplyData_JoinsMultiTaskDDL. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- isShardedApply now also requires every shard work operation to share one namespace. A multi-keyspace apply (still one deployment) would otherwise route to the sharded layout and mislabel the keyspace / mix unrelated shard groups; it now falls back to the existing layout. - countChanges counts a keyspace's per-shard statements when the collapsed namespace-level Statements are absent (keyspaceStatementCount), so a sharded plan whose only DDL is per-shard is not miscounted as "no changes" and gets correct summary counts. - Tests for multi-keyspace fallback and per-shard-only plan counting. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…n apply Apply comment: a remote failure records its error on the operation's task, and the operator may not stamp it onto the operation row — so the sharded apply comment showed a failed shard with no error (forcing a trip to logs). aggregateShardState now falls back to the first task error when the operation row carries none, so a failed shard always shows why. Plan comment: an unsafe change confined to one shard (e.g. a column drop on a single drifted shard) is now flagged with that shard. shardedUnsafeChanges derives unsafe changes from the per-shard plan (grouped by table+reason → shards) when present, since the collapsed namespace-level Changes can omit a shard's change; the unsafe-changes warning names the shard(s). Tests: apply falls back to task error; plan surfaces a per-shard unsafe change scoped to its shard (both the builder and the rendered warning). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copilot fixes:
- parseShardOperationKey rejects keys with extra segments (strict 3-part split),
so "ns/-40/table/extra" is no longer misclassified as shard work.
- buildShardedApplyData sorts each operation's tasks by id before joining DDL,
so the joined DDL and the change signature are deterministic (tasks arrive
created_at DESC). In practice an operation has one task — multiple statements
for one table are combined into one ALTER upstream — so this is defensive.
- planResponseFromProto drops shard plans with no changes, and buildPlanCommentData
skips per-shard entries with no DDL, so an empty shard never renders a blank
section or inflates the change count (a shard is changing iff it has changes).
Control-plane fail-closed guard (requested):
- dispatchPendingApply refuses to dispatch a shard work operation
("namespace/shard/table") that resolves no target shard, with a clear error
naming the skew, instead of sending an empty TargetShards that the data plane
rejects opaquely as "expected exactly one target shard, got 0".
Also correct the per-shard unsafe test/preview: a same-table change is one
combined ALTER, not multiple statements (multi-statement-per-table is unsupported).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The sharded apply/plan comments weren't represented in the generated TEMPLATES.md catalog. Add preview renderings (pkg/webhook/templates/ preview_sharded.go) for the divergent plan, the per-shard unsafe-change plan, and the in-progress / failed / divergent apply comments — per-shard statuses derived through pkg/presentation so they match production — wire a comment_sharded_all aggregate preview type, and add a "Sharded Apply" section to update-templates.sh. Regenerated TEMPLATES.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
aparajon
approved these changes
Jun 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Gives sharded schema changes their own PR-comment layout — for both the apply and the plan comment — instead of reusing the multi-deployment one. A sharded change (from the per-shard fan-out in #535) runs one operation per
(shard, table)across the shards of a single keyspace within one deployment, so the unit is the shard, not the deployment.The bug this fixes (apply comment)
The status comment reused the multi-deployment renderer. Every per-shard operation shares the one deployment (
cake), so:cakerows (Deployments: 3 halted, 1 failed), andDetailsmap is keyed by deployment name, so the shards collided onto one entry — the failed shard's detail was overwritten by a pending sibling's, and its error was dropped. That's why a failed sharded rollout surfaced no error.Changes
Apply comment —
RenderShardedApplyComment(pkg/webhook/templates/sharded_apply.go)Shards are grouped by change signature: uniform → one
Shard | Statustable + DDL once; divergent (different tables, or the same table computing to different DDL because shards drifted) → one group per change set, each tying its shards' statuses to the DDL it runs. The first failed/retrying shard's error is lifted to the top and shown in its row.failed_retryableis handled like the single-deployment renderer (error surfaced + "stop retrying" footer).Plan comment —
writeKeyspaceChanges(pkg/webhook/templates/plan.go)A sharded keyspace now renders what applies where: shards grouped by change signature, uniform shows the DDL once, divergent shows one DDL block per distinct change set labelled with its shards. This needed threading the per-shard plan through
apitypes.PlanResponse.Shards(it previously stopped at the namespace-levelChanges) viaplanResponseFromProto→buildPlanCommentData.Detection / projection (
pkg/webhook/sharded_apply.go)isShardedApplyrecognises the fan-out (work ops keyednamespace/shard/table, one deployment). Per-shard status is derived throughpkg/presentationwith the shard name as the operation identity, so ordering labels reference shards ("waiting for-40", "halted —-40failed") with no change to the presentation package.formatApply{Status,Summary}Commentroute sharded applies to the new renderer.No gap change required
The per-shard data already flows from gap's Strata engine through the proto plan/apply responses (#535); only the control-plane rendering changes here.
Tests
pkg/webhook+pkg/webhook/templates+pkg/api+pkg/apitypessuites pass. Covers: sharded detection / key parsing; a failed sharded apply surfacing the error (the regression);failed_retryablehandling; uniform vs divergent apply rendering; divergent vs uniform plan rendering; and per-shard plan data threading throughbuildPlanCommentData.Not in this PR (follow-up)
group_finalizeroperations are omitted from the shard view; their outcome still shows in the aggregate headline state.🤖 Generated with Claude Code